Data Setup

Methods

Running Code

Running the SOM clustering procedure with our given parameters.

#Code using kohonen package instead of sombrero
gridsize=20
num_clusters=8

set.seed(123)
#Uncomment the below lines to be able to generate a map de novo
# som.grid <- somgrid(xdim=gridsize,ydim=gridsize,topo="hexagonal",toroidal=TRUE)
# som.model <- som(metabolite_matrix,grid=som.grid,rlen=1500,alpha=c(0.025,0.01))

#Comment this line out if you plan to generate a map de novo
load("Data/final-kohonen-som.RData")

#This has been an issue but we need to set the specific seed present *after* running the original SOM model to be able to reproduce the exact results
.Random.seed <- good.seed

#Plot the training progress to see and also to ensure right parameters were used
# plot(som.model,type="changes")

carve.som <- som.model$codes[[1]]
som.dist <- as.matrix(dist(carve.som))

try.k <- 2:20

#Load the pre-saved cluster data
load("Data/final-som-clusters.RData")
k <- num_clusters

#There's a nebulous issue with defining seeds while generating a markdown file, so at the moment we are loading a pre-saved data object containing the relevant cluster definitions using the correct seeds. You can uncomment this if you wish to run the chunks directly.

# cluster.dist.eval <- as.data.frame(matrix(ncol = 3, nrow = (length(try.k))))
# colnames(cluster.dist.eval) <- c('k', 'kmeans', 'hclust')
# 
# for(i in 1:length(try.k)) {
#   cluster.dist.eval[i, 'k'] <- try.k[i]
#   cluster.dist.eval[i, 'kmeans'] <- clusterMeanDist(kmeans(carve.som, centers = try.k[i], iter.max = 20)$cluster)
#   cluster.dist.eval[i, 'hclust'] <- clusterMeanDist(cutree(hclust(vegdist(carve.som,method="euclidean")), k = try.k[i]))
# }

#Plot the kmeans distances as a function of the number of clusters for both clustering methodologies
plot(cluster.dist.eval[, 'kmeans'] ~ try.k,
     type = 'l')

lines(cluster.dist.eval[, 'hclust'] ~ try.k,
      col = 'red')

legend('topright',
       legend = c('k-means', 'hierarchical'),
       col = c('black', 'red'),
       lty = c(1, 1))

# cluster.tries <- list()
# 
# for(k in num_clusters){
# 
#   ## k-means clustering
# 
#   som.cluster.k <- kmeans(carve.som, centers = k, iter.max = 10000, nstart = 10)$cluster # k-means
# 
#   ## hierarchical clustering
# 
#   som.dist <- dist(carve.som) # hierarchical, step 1
#   som.cluster.h <- cutree(hclust(som.dist), k = k) # hierarchical, step 2
# 
#   ## capture outputs
# 
#   cluster.tries[[paste0('som.cluster.k.', k)]] <- som.cluster.k
#   cluster.tries[[paste0('som.cluster.h.', k)]] <- som.cluster.h
# }

obs_density <- data.frame(node=som.model$unit.classif)%>%group_by(node)%>%summarise(count=n())
# ggplot(obs_density,aes(x=count))+geom_histogram()+custom_theme

#Want to find the average variance and number of models represented in each cluster
clusters <- som.cluster.k
ids <- som.model$unit.classif
sample_clusters <- som.cluster.k[som.model$unit.classif]

na.nodes <- which(c(1:gridsize^2)%in%unique(obs_density$node)==FALSE)
som.cluster.k[na.nodes] <- NA

all_vars <- c()
for (i in 1:k){
  curr_clust=i
  
  curr_models <- which(clusters==curr_clust)
  
  if (length(curr_models)>1){
    curr_vars <- apply(metabolite_matrix[curr_models,],2,sd)%>%mean()
    all_vars <- c(all_vars,curr_vars)
  } else {
    all_vars <- c(all_vars,0)
  }
}

# var_df <- data.frame(clust_size=table(sample_clusters),mean_var=all_vars)
# colnames(var_df) <- c("cluster","num_models","mean_var")

# ggplot(var_df,aes(x=num_models,y=mean_var,color=cluster))+geom_point(size=5)+labs(color="Cluster",x="Number of models in cluster",y="Average variance in cluster")+scale_color_manual(values = c("#a0a052","#6c63d7","#a0b533","#c957ca","#5db855","#98459e","#d89d37","#7a93dd","#c9532a","#46aed7","#d6445a","#50b593","#d1438d","#4c7937","#c285de","#91682d","#625fa3","#dd8a68","#d883af","#a44d61"))

SOM cluster analysis… done!

## [1] "There were 578 genomes with duplicate growth sensitivity vectors across all models generated for that genome."
## [1] "MARD_SAMN05216575_REFG_MMP05216575 has 2 SOM clusters with 29 models each."
## [1] "MARD_SAMN07327725_REFG_MMP07327725 has 2 SOM grid points with 18 models each."
## [1] "MARD_SAMN08912472_REFG_MMP08912472 has 2 SOM grid points with 21 models each."

Tree generation… done!

Metabolite pattern analysis… done!

Phylogenetic analysis… done!

Growth estimate (dCUB) analysis… done!

## [1] "Results of non-parametric Kruskal-Wallis Test:"
## 
##  Kruskal-Wallis rank sum test
## 
## data:  kruskal.list
## Kruskal-Wallis chi-squared = 128.06, df = 7, p-value < 2.2e-16

Comparison to Matti’s data… done!

## [1] "Metabolite sarcosine has no matches!"
## [1] "Metabolite taurine has no matches!"
## [1] "Metabolite hydroxyproline has no matches!"
## [1] "Metabolite oxalacetate has no matches!"
## [1] "Metabolite valerate has no matches!"
## [1] "Metabolite malonate has no matches!"
## [1] "Metabolite 3m2-oxybutyrate has no matches!"
## [1] "Metabolite gluconate lactone has no matches!"
## [1] "Metabolite glucuronate lactone has no matches!"
## [1] "Metabolite glcnac has no matches!"
## [1] "Metabolite galnac has no matches!"
## [1] "Metabolite melibiose has no matches!"
## [1] "Metabolite lactulose has no matches!"
## [1] "Metabolite alpha-cyclodextrin has no matches!"
## [1] "Metabolite erythrose has no matches!"

Read recruitment analysis… done!

Table Generation

Results

Data phylogeny and model consensus

Figure 1: Diversity of dataset, quality of metabolic models, and designation of metabolic clusters. Phylogenetic tree of all 3,984 bacterial genomes included in this study (including the 66 reference genomes from the BiGG database). The tree is contextualized by several external rings that describe different qualitative and quantitative components of the genomes in this study. The first ring around the tree denotes both the position and density of high quality ensembles within the tree as well as the assignment of these genomes to each of our eight SOM clusters. The second ring shows the ensemble consensus score (Equation 1) for each genome in the tree. The third, sparse ring of red lines denotes the position of the 66 BiGG reference genomes present in the tree. Finally, the fourth and innermost ring shows the location of the top 15 most abundant orders.

Figure 2: Substrate sensitivities for 8 SOM clusters. Bubble plot of the mean growth sensitivity values for genomes in each of our 8 SOM clusters. A growth sensitivity of 1 indicates high sensitivity to that substrate such that the modeled growth rate was reduced proportionally to the reduction in the substrate’s flux (e.g., 50% substrate reduction corresponded to 50% growth rate reduction). The size of the bubbles in this plot reflect the relative sensitivity of each of the 8 SOM clusters to a given compound class where larger bubbles indicate that cluster was more sensitive to that compound class than others. The 6 compound classes which resulted in significant growth reduction for at least one of the SOM clusters are shown here. The full results for all 11 substrate classes are provided in Supplemental Figures S2 & S3). Cluster numbers were colored based on maximal genomic growth rate (Supplemental Figure S6).

Figure 3: Biogeographical relative abundances of 8 SOM clusters. Clustered bar charts of the relative abundances of the 8 SOM clusters as determined by RPKM at each of the 1,203 stations assigned to one of the 23 oceanographic regions. Stations were grouped into our 5 defined oceanographic categories and then arranged based on a hierarchical clustering of the relative abundances at the stations assigned to each of these 5 groups. These relative abundances further establish the validity of our predicted metabolic clusters. In accordance with the literature, fast growing copiotrophs dominate eutrophic environments like estuaries before being succeeded by slower growing specialists in low nutrient open ocean environments. The observed variability within each category also demonstrates the potential for environmental characteristics (e.g., nutrient concentrations) to dictate, at least in part, the local microbial community composition.

Supplemental Figures

Supplemental Figure S1: Bubble plot of the mean growth sensitivity values for two new sets of 6 SOM clusters, rather than 8, generated on the 1,591 genomes with a consensus of 80% or greater (top) and the 983 genomes above a more stringent threshold of 90% or greater (bottom). A growth sensitivity of 1 indicates high sensitivity to that substrate such that the modeled growth rate was reduced proportionally to the reduction in the substrate’s flux (e.g., 50% substrate reduction corresponded to 50% growth rate reduction). The size of the bubbles in this plot reflect the relative sensitivity of each of the 6 SOM clusters to a given compound class where larger bubbles indicate that cluster was more sensitive to that compound class than others. While the ordering of the clusters as well as the number of clusters of each type changes, we still observe the same overall patterns. We have one fast-growing cluster with no growth sensitivities, multiple clusters with sensitivity to one compound, and multiple with sensitivity to two compounds. The fast growth cluster and intermediate growth single sensitivity clusters from the primary analysis emerged in this higher consensus group of models. The shifts in sensitivities and reduction in number of clusters for the slow-growing clusters is consistent with the fact that the more classically oligotrophic orders generally had model ensembles with lower consensus values. Thus, increasing the consensus threshold is more likely to exclude these genomes from the SOM generation and would be expected to have the largest impact on these clusters.

Supplemental Figure S2: Distribution of growth sensitivity values by cluster. Density plots of the growth sensitivity values for each model for each of the 11 compound classes grouped by SOM cluster (N=1,050,060).

Supplemental Figure S3:Comparison of results between CarveMe model ensembles and experimental growth studies performed in Gralka et al. Heatmap of agreement between CarveMe model ensemble reactions and experimental growth data for a collection of 146 strains grown on 58 different sole carbon sources. The specific classification of each compound according to our manually curated categories is shown as a sequence of colored bars along the x-axis. When comparing the growth/no growth predictions between CarveMe and the experimental data, we looked for perfect agreement, as well as agreement when supplementing the CarveMe models with one (single rescue) or two (double rescue) non-Carbon compounds. Collectively, we considered these three cases (shades of blue in the heatmap) as agreement between the model and experimental data, or a true positive. The gray squares represent cases where the CarveMe model had the pathway to uptake the compound in question but did not demonstrate growth on it, suggesting the model may have been missing key pathways to catabolize that compound. The dark pink squares reflect the case where the experimental data predicted growth but the CarveMe model neither grew on that compound nor had the pathway to uptake that compound. The light gray and dark pink squares collectively represent the instances of false negatives in the comparison. Finally, the light pink squares represent the case where we predicted growth on a compound from the CarveMe models but those microbes showed no ability to grow on that compound in the experimental data (false positives).

Supplemental Figure SX: Bootstrap experiment of model accuracy to experimental data. We performed a bootstrap analysis to compare the accuracy (model agreement) of the media predictions from CarveMe models built on the strains from Gralka et al. to the accuracy we would achieve using a randomized data schema. In this analysis the binary growth/no growth outcomes were randomized such that the randomized data had the same number of positive growth signals for each of the genomes from the experimental study. This was then simulated 10,000 times to establish a distribution of accuracy values using random data. The mean value of these measured accuracy values is shown with the blue dashed line, while the data beyond 2 standard deviations in each direction is highlighted by the blue shaded areas of the Gaussian curve drawn over this distribution. Our real accuracy measurement is shown with the dashed orange line, sufficiently higher than the predicted mean accuracy with randomized data (note the break in the x-axis values).

Supplemental Figure S4: Relative growth sensitivities between SOM clusters. (a) Ordered bar plot of the proportion of models across all clusters with substantial growth sensitivity to the reduction of each compound class (substantial is defined as >80% reduction in growth). (b) Bar plots of the relative mean growth sensitivity values for each of the 11 compound classes across the 8 SOM clusters. The error bars represent one standard deviation. Plot facets are ordered from the highest overall sensitivity (carboxylic acids) to the lowest (amines/amides).

Supplemental Figure S5: Taxonomy by cluster. (a) Ordered bar plot of the proportion of models in each of the 15 most abundant orders with substantial growth sensitivity to the reduction of any compound class (substantial is defined as >80% reduction in growth). (b) Stacked bar plots of the relative abundances of the top 15 orders in each cluster.

Supplemental Figure S6: Codon usage bias (dCUB) by cluster. The dCUB values fall into four statistically distinct groups designated with letters according to the key. Statistical groupings were determined by an ANOVA test of the dCUB distributions using Tukey’s HSD with a 95% confidence interval. Group a is the slow-growing group (Clusters 1, 3, 7 and 8) and statistically distinct from the other clusters. Group c (Cluster 2) is the fast-growing group and is distinct from all other clusters. Groups ab and bc represent our intermediate growers. Specifically, group ab (Clusters 4 and 6) are statistically distinct from fast-growing group c but not from slow-growing group a, whereas group bc (Cluster 5) is statistically distinct from group a but not from group c.

Supplemental Figure SX:Bootstrap experiment of dCUB growth rates for randomized SOM clusters. Similarly to Supplemental Figure SXX, we performed a bootstrap analysis to compare the proportion of genomes below the critical dCUB threshold of -0.08 for clusters with randomly assigned genomes versus the real SOM clusters predicted by our method. The mean proportion from the bootstrapping analysis is shown in the dashed orange line. The area of the distribution beyond 2 standard deviations from this mean is highlighted in the blue shaded area of the Gaussian curve drawn over this distribution. The mean proportion of genomes below this threshold for our 8 SOM clusters are shown by the various dashed lines colored according to which cluster they represent. We see that our two intermediate growth clusters fall within our randomized distribution, which is unsurprising given that they represent an intermediate growth rate physiology. The 4 slow growing clusters fall more than 2 standard deviations below the bootstrap mean, indicating that those clusters grow more slowly than can be explained by random effects. Conversely, the fast and fast intermediate clusters fall more than 2 standard deviations above the bootstrap mean, confirming that on average they grow faster than we would expect from a random group of genomes.

Supplemental Figure S7: Taxonomic abundance enrichments of top 15 Orders by cluster. Percentage enrichment in the relative abundance of the top 15 Orders (and Other) in each of the 8 SOM clusters compared to the relative abundances of each of these Orders in the full dataset.

Supplemental Figure S8: Relative abundances of SOM clusters by oceanographic region. Sampling sites were grouped for bootstrapping according to the 23 oceanographic regions given in Table S3. For each region, bar plots of the average relative abundances of each of the 8 SOM clusters are shown. The relative abundances are calculated based on the bootstrap distributions of the raw RPKM values. The clusters are colored by their growth strategy (fast, fast-intermediate, slow-intermediate, and slow). Error bars represent the standard deviations of the bootstrapped distributions.

Supplemental Figure S9: Relative abundances of SOM clusters by oceanographic category. Sampling sites were grouped for bootstrapping according to the 5 oceanographic categories. Bar plots show the average relative abundances of each of the 8 SOM clusters in each category where the abundance is based on the bootstrap distributions of the raw RPKM values. The clusters are colored by their growth strategy (fast, fast-intermediate, slow-intermediate, and slow). Error bars represent the standard deviations of the bootstrapped distributions.

Supplemental Figure S11: SOM metrics. (a) The SOM grid is shown where each circle represents a grid point in the map (N=400). Grid points are colored by their assignment to the 8 defined SOM clusters. Grid points to which no genomes were assigned are colored gray to represent the absence of mapped data. It is important to note that the SOM uses a toroidal grid where the edges wrap around such that, for example, all nodes in Cluster 8 are in fact connected. (b) The training progress of the grid is shown for the duration of the map refinement process. (c) The number of models from each genome ensemble that were assigned to each SOM cluster, where a value of 60 denotes instances when all models from the ensemble were assigned to the same SOM cluster and a value of 0 denotes that no models from a specific ensemble were assigned to the cluster. Data is only shown for the 1,591 high consensus ensembles. The bimodal distribution of the data around 0 and 60 illustrates that all models from a given ensemble were almost always assigned to the same SOM cluster.

Supplemental Figure S12: PCA plot of the growth sensitivity data. The PCA captured 50.6% of the total variance on the first two principal component axes, and distinguished two major groups of data points, primarily separated by sensitivity to sugars (Carbohydrates/Derivatives) versus acids (Carboxylic Acids). Of note, the estimates of maximum growth rate were not included in this clustering. The points in the PCA are colored by SOM cluster assignment to illustrate that both approaches identified similar clustering of the data but that the SOM method was able to better differentiate between the 8 clusters.